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Abstract 


Background: Artificial intelligence (AI) has the potential to personalise mechanical ventilation strategies for patients with 
respiratory failure. However, current methodological deficiencies could limit clinical impact. We identified common 
limitations and propose potential solutions to facilitate translation of AI to mechanical ventilation of patients. 
Methods: A systematic review was conducted in MEDLINE, Embase, and PubMed Central to February 2021. Studies 
investigating the application of AI to patients undergoing mechanical ventilation were included. Algorithm design and 
adherence to reporting standards were assessed with a rubric combining published guidelines, satisfying the Trans- 
parent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis [TRIPOD] statement. Risk of 
bias was assessed by using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), and correspondence with 
authors to assess data and code availability. 

Results: Our search identified 1,342 studies, of which 95 were included: 84 had single-centre, retrospective study design, 
with only one randomised controlled trial. Access to data sets and code was severely limited (unavailable in 85% and 87% 
of studies, respectively). On request, data and code were made available from 12 and 10 authors, respectively, from a list 
of 54 studies published in the last 5 yr. Ethnicity was frequently under-reported 18/95 (19%), as was model calibration 17/ 
95 (18%). The risk of bias was high in 89% (85/95) of the studies, especially because of analysis bias. 

Conclusions: Development of algorithms should involve prospective and external validation, with greater code and data 
availability to improve confidence in and translation of this promising approach. 

Trial registration number: PROSPERO — CRD42021225918. 
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Editor’s key points 


e This systematic review was conducted to identify 
common limitations and potential solutions to the 
translation of artificial intelligence to mechanical 
ventilation of patients. 

e Of 1,342 studies identified, 95 studies were included, 

most of which were single-centre retrospective 

studies with limited access to data sets and code, and 
high risk of bias. 

Artificial intelligence applied to mechanical ventila- 

tion has limited external validation and model cali- 

bration, with substantial risk of bias, significant gaps 
in reporting, and poor code and data availability. 

e Future development of algorithms should involve 
prospective and external validation, with rigorous 
adherence to standards, and code and data avail- 
ability to facilitate translation of data science into 
improved approaches to mechanical ventilation. 


Invasive mechanical ventilation is a vital supportive therapy 
for critically ill patients with respiratory failure.’ These pa- 
tients are heterogeneous in terms of disease aetiology, lung 
pathology, and respiratory mechanics; ^ ventilation strate- 
gies based on current guidelines do not guarantee lung pro- 
tection to individual patients, and inappropriate settings can 
increase morbidity and mortality.” ” The best available evi- 
dence is based on results from a few clinical trials, where 
outcomes are identified at a population level and significant 
heterogeneity exists between individual patients.®? The risk of 
applying population data to single patients has been high- 
lighted by secondary analyses of the trial data. However, the 
experimental evidence to guide personalised ventilation 
strategies is lacking. 

Artificial intelligence (AI) could offer a solution in the 
pursuit towards personalised mechanical ventilation, capi- 
talising on the widespread use of electronic monitoring and 
recording in high-income countries and the associated 
wealth of data generated during mechanical ventilation of 
intraoperative and ICU patients. The capacity of AI to predict 
sepsis," circulatory failure,” and patient mortality’? dem- 
onstrates the potential for further applications in the 
anaesthesia and critical care settings. However, the current 
AI landscape is increasingly acknowledged as incomplete, 
with frequent concerns regarding AI reproducibility and 
generalisability,‘* 1° and methodological limitations during 
algorithm design and validation can limit transferability and 
clinical translation. Reporting guidelines have been devel- 
oped to improve standards of publication and to reduce the 
potential for bias. However, these parameters have not been 
evaluated for AI specifically in mechanical ventilation.” 

We aimed to identify common methodological limitations 
and propose changes in publication standards that would 
enable translation of AI algorithms into clinical practice to 
support mechanical ventilation. Therefore, we conducted a 
systematic review of the AI literature as applied to mechanical 
ventilation, evaluating adherence to the Transparent Report- 
ing of a multivariable prediction model for Individual Prog- 
nosis Or Diagnosis (TRIPOD) statement, risk of bias using the 
Prediction model Risk Of Bias ASsessment Tool (PROBAST), 
algorithm design using a novel rubric, and the availability of 
both data and code.**1° 


Methods 


The systematic review protocol was registered with the online 
International Prospective Register of Systematic Reviews 
database (CRD42021225918) before search execution; de- 
viations are reported in Supplementary file 1. This study was 
created in accordance with the Preferred Reporting Items for 
Systematic Reviews and Meta-Analyses (PRISMA) guidelines,”° 
with a checklist displayed in Supplementary file 2. No insti- 
tutional ethical approval was required. 


Study identification and inclusion criteria 


A comprehensive search strategy was performed using free 
text and MeSH terms of various forms of the keywords ‘arti- 
ficial intelligence’ and ‘mechanical ventilation’; the full search 
strategy is outlined in Supplementary file 3. Three major 
electronic databases (MEDLINE, Embase, and PubMed) were 
searched from database creation until February 26, 2021, with 
no additional filters used. Additional articles were identified 
from references of included studies. Inclusion and exclusion 
criteria are detailed in Table 1. 


Study selection and data extraction 


All texts were screened by one author (JG), and subsets were 
also screened by a second author (FF, MDPAL, or JZ) so that 
each text was screened twice independently. Disagreements 
emerged from this independent screening were resolved by 
consensus. Data were systematically extracted from each 
study into a predesigned spreadsheet and analysed post hoc 
using pivot tables. 


Adherence to standards 


Reporting quality was assessed using a rubric generated by 
MIT Critical Data, led by the Laboratory for Computational 
Physiology, combining published guidelines (Supplementary 
file 4). This approach satisfies the TRIPOD statement,” ‘a 
checklist that aims to increase transparency of publications 
developing predictive models for medicine’.2) Further 
emphasis is provided on the impact of generalisability and 
assessment of the availability of data and code, which are 
fundamental for reproducibility in this discipline. 

In addition, the authors of included studies published since 
January 1, 2016 (n=54) for which code or data were not publicly 
available were e-mailed for access. The time frame was 
limited to the past 5 yr based upon the significant increase in 
publications over this period and to identify current barriers to 
sharing data sets, not to determine whether data are kept for 
prolonged periods of time. Risk of bias was assessed using 
PROBAST.’® The overall group adherence was calculated as a 
percentage, and adherence groups created based upon 
reporting of 0—33.3%, 33.4—-66.6%, or 66.7—100%. Descriptive 
statistics were calculated as mean (standard deviation [sp]) for 
continuous variables, and frequencies and rates for categori- 
cal data. No formal quantitative synthesis was performed 
because of the heterogeneity in outcomes studied, models 
used, and reporting levels. 


AI maturity 


There is no agreed upon framework to determine the maturity 
of clinical AI literature.*”?? We assessed the AI maturity of 
each study using the US Food and Drug Administration (FDA) 
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Table 1 Inclusion and exclusion criteria for study selection. “Invasive mechanical ventilation is defined as a mechanism of respiratory 
support delivered to a patient with a tracheal tube. Paediatric patients were included to provide a comprehensive assessment of the 
literature. 'Artificial intelligence algorithms were defined as computational programmes with the capacity to learn, evaluate their own 
performance, and update their rules, to facilitate a prediction output. ‘Human clinicians had to be equivalent level to board certifi- 
cation/completion of specialty training to be considered expert. Al, artificial intelligence; NA, not applicable. 


Inclusion criteria 


Exclusion criteria 


Non-human participants (animals or modelling 


Participants Participants (adult and paediatric) undergoing 
invasive mechanical ventilation* 
Intervention Assess application of Al! to patient data 


Investigation of outcomes directly related to 


data generated algorithmically) 


Application of AI to inform internal ventilator 
operation, increase clinical imaging 


mechanical ventilation 


Comparator (if presented) 


Clinicians? or previously validated models 


techniques resolution, or identify clinical 
anatomy 
Primary use of neural signals or genetic/ 
molecular markers as candidate predictors 
Models predicting successful intubation or 
investigating obstructive sleep apnoea or 
noninvasive ventilation 


NA 


Study type Published peer-reviewed scientific reports in Review articles, commentaries, letters, 
English until the search date editorials, and other informal publication 
types 
Outcome Model performance NA 
Patient-related outcomes (e.g. mortality and 
ventilation requirement) 
Context Hospitals with critical care facilities that NA 


manage patients undergoing invasive 


mechanical ventilation 


Data generated in this setting, in the context of 
routine clinical care, or generated in the 


context of prospective research 


guidelines and expert opinion.” Studies were categorised into 
the following groups: (i) ‘maths into algorithm’ (authors pro- 
pose new algorithmic construct with some feasibility testing), 
(ii) ‘algorithm into model’ (authors develop a new model based 
on retrospective data with some prospective feasibility 
testing), (iii) ‘model into device’ (validation of a previously 
developed model, ideally against an existing gold standard), 
and (iv) ‘device into practice’ (prospective clinical evaluation 
of deployed devices). 


Results 
Study selection 


We identified 1,342 studies, including 434 duplicates, resulting 
in 908 studies for which abstracts were screened (Fig. 1). After 
abstract and full-text screening, a total of 95 studies were 
included in the systematic review; a full list of included and 
excluded studies can be found in Supplementary file 5. 


Study characteristics 


Several studies were single-centre studies (53/94; 56%) (one not 
reported), primarily with a retrospective design (84/95; 88%), 
few were conducted prospectively (11/95; 12%), and only one 
RCT. The five countries with the highest number of published 
studies were the USA (34/95; 36%), Spain (13/95; 14%), China (7/ 
95; 7%), Brazil (5/95; 5%), and UK (5/95; 5%), together repre- 
senting 67% of the included studies; one study did not define 
the country of the studied population. 


Three studies recruited neonatal patients (mean age 22 
weeks); the remaining studies had a population age (mean [sp]) 
of 61 [8] yr (64/95 reported). A total of 58 studies (61%) reported 
the sex of participants (61% male; 39% female). Only 18/95 
(19%) studies reported at least part of the ethnic distribution of 
the cohort: overall, the mean distribution [sp] was 57 [17]% 
White, 26 [20]% Black, 29 [23]% Asian, and 5 [2]% Hispanic; 
studies may have reported 81% for a single ethnicity and no 
details of other ethnicities, so percentages do not sum to 100%. 

The number of publications by year increased significantly 
from 2016 to 2020, with 11 studies publishing new models in 
the first 2 months of 2021 alone (Fig. 2). However, progression 
towards maturity was limited, as no ‘device into practice’ 
technology was identified, despite the increasing rate of ‘al- 
gorithm into model’ studies published in recent years. 


Adherence to standards 


Table 2 presents percentage adherence of relevant outcomes; 
notably, there were very limited freely available code and data, 
available in only 13% and 15%, respectively. We e-mailed 54 
authors for access to code and data: one e-mail was auto- 
matically returned because of a non-existent account; 12 and 
10 were able to provide access respectively to code and data on 
request; three were not able to comply because of ethical or 
regulatory reasons; and 37 did not respond. 


Risk of bias 


Overall, 10 studies were classified as low risk, whereas the 
remaining 85 were classified as high risk of bias (Fig. 3). The 
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Records identified through 
database searching 
(n=1338) 


Additional records identified 
through other sources 


(n=4) 





Records after duplicates removed 


(n=908) 


Abstracts screened 
(n=908) 


Full-text articles assessed 


for eligibility 
(n=178) 


Studies included in 
qualitative synthesis 
(n=95) 


Excluded (n=730) 
e Not investigate MV (n=397) 


e Publication type (n=159) 

e Intubation success (n=40) 
e OSA/CPAPINIV (n=34) 

e Not investigate Al (n=33) 
e Other (n=67) 


Excluded (n=83) 

e No patient data (n=22) 

e Not investigate MV (n=21) 
e Not investigate Al (n=20) 
e Other (n=19) 





Fig 1. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow chart of study records. The full list of screened 
papers is available in Supplementary file 5. AI, artificial intelligence; MV, mechanical ventilation; NIV, noninvasive ventilation; OSA, 


obstructive sleep apnoea. 


majority were deemed low risk in participants (84/95), pre- 
dictors (88/95), and outcome analysis (95/95), but 85/95 were 
deemed high risk of bias in the analysis section. This high risk 
of bias was attributable to inadequate reporting of missing 
data handling (40/95) and insufficient detail on model perfor- 
mance (79/95), where the majority failed to mention calibra- 
tion (78/95). 


Clinical outcomes 


The four most published prediction outcome categories were 
predicting weaning success (23/95; 24%), predicting 
commencement of mechanical ventilation (i. e. within 24 h) 
(22/95; 23%), predicting a complication in ventilated patients 
(18/95; 19%), and detecting patient—ventilator asynchrony (12/ 
95; 13%); only 6% (6/95) reported AI to inform a clinical decision 
support system. The most common models used were neural 
networks (30/95; 32%), decision trees (28/95; 29%), and clus- 
tering algorithms (10/99; 11%). 

The majority of studies (75/95; 79%) did not compare the 
results of the model developed or validated with either clini- 
cian performance, a previously validated model, or standard 
index (e.g. early warning score). The same majority conducted 
analysis on cohort results, without any individualised 
approach like bedside support systems. 


Discussion 


To our knowledge, this is the first systematic review con- 
ducted specifically in the field of AI for mechanical ventilation. 
The exponentially growing body of AI literature has largely 
attempted to predict onset or aid mechanical ventilation 
weaning, with less focus on clinical decision support. Meth- 
odological deficiencies were frequent and changes in practice 
are necessary to ensure development of a robust evidence 
base in this emerging field. 


External validation is critical to protect patients and 
trust in AI algorithms 


A significant majority of retrospective analysis was observed, 
with only one RCT.” This finding has been observed in other 
domains and limits the potential for translation to clinical 
care Se 

Despite greatly improved internal validation techniques, 
there is still limited external validation, and few multicentre 
studies exist, which limit both confidence in the validity, and 
applicability to different geographical regions. A dispropor- 
tionately large contribution comes from a few countries (e. g. 
USA; 34/95) and in some cases cities (e. g. Boston, MA; 8/95). 
This skewed geographical distribution limits the algorithms’ 
applicability to demographically and genotypically different 


Artificial intelligence for mechanical ventilation | 5 





m Device into practice 

Hi Model into device 

H Algorithm into model 

o Mathematics into algorithm 


25 





20 
15 


10 


o M -0C 


Number of published studies 











J 








1990-9* 2000-9* 2010 2011 


2012 2013 2014 2015 2016 2017 2018 2019 2020 
Year 


Fig 2. Temporal distribution illustrating the number and maturity of studies included in this review. The average number of publications 
per year is presented for two decades (1990—9* and 2000—9*) because of the limited number of relevant publications in that period; the 
highest number of publications in any 1 yr during that period was three (2002 and 2009). There has been a significant rise in the number of 
artificial intelligence algorithms being used to develop models applied to mechanical ventilation in the last 5 yr; 55 since January 2016. 
However, there has been no consistent shift towards device creation and subsequent deployment and evaluation in clinical practice. 


populations, and to practices that differ between health sys- 
tems, such as criteria for ICU admission and intubation, which 
may not be accounted for in the algorithms.° These factors 
will affect the distribution of features in the training and 
validation cohorts, introducing collider bias and leading to 
spurious associations that find their way into the algorithms.” 
A ‘collider’ refers to an independent third variable caused by 


an exposure and outcome; collider bias is the distorted asso- 
ciation between the exposure and outcome when controlling a 
collider.” This has important implications for the sampling 
methods used in data science and in database formation. A 
potential benefit of benchmark data sets that represent 
diverse patient cohorts is that they could allow for evaluation 
of generalisability and recalibration of algorithms.*° 


Table 2 Adherence to reporting standards. The respective Transparent Reporting of a multivariable prediction model for Individual 
Prognosis Or Diagnosis (TRIPOD) reference is displayed in round brackets and the percentage adherence in square brackets. NA, not 


applicable. 


Bottom tertile (0—33.3) 


External validation [21%] 
Outliers (NA) [20%] 

Ethnicity split (13b) [19%] 
Calibration (16) [18%] 

Data freely available (21) [15%] 
Code freely available (21) [13%] 
Protocol (21) [4%] 


Abstract (2) [63%] 


Middle tertile (33.4—66.6) 


Outcome comparison (14a) [65%] 
Exclusion criteria (5b) [64%] 


Study dates (4b) [62%] 
Clinically relevant (16) [59%] 
Missing data (9) [58%] 

Sex split (13b) [56%] 


Upper tertile (66.7—100) 


Medical context (3a) [100%] 
Objectives (3b) [100%] 

Study design (4a) [100%] 
Prediction outcome (6a) [100%] 
Inclusion criteria (5b) [99%] 
Setting (5a) [98%] 

Sample size (8) [98%] 

Data flow/summary (13a) [98%] 
Defined predictors (7a) [97%] 
Discrimination (16) [97%] 
Internal validation (10b) [95%] 
Data pre-processing (10a) [93%] 
Funding (22) [84%] 

Full prediction model (15a) [75%] 
Mean age of sample (13b) [67%] 
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Fig 3. Risk of bias assessed using Prediction model Risk Of Bias ASsessment Tool (PROBAST) reporting standards for all included studies. 
Low indicates the percentage of studies that did not receive any high-risk rating for a particular category. High denotes those that achieved 


a high-risk rating in at least one criterion for a particular category. 


Ethnicity was largely under-reported, making comparison 
of algorithm performance when applied to different ethnic- 
ities impossible, and potentially contributing to disparities 
once implemented.’ ** If sufficient prospective and/or 
external validation has demonstrated benefits to patients and/ 
or clinicians, generalisability may be falsely assumed, but it is 
vital to ensure algorithm performance is continuously moni- 
tored, and especially its effect on health disparities. 


Calibration translates model discrimination into 
patient-level outcomes 


Model discrimination was universally reported with the most 
common index being area under the receiver operating char- 
acteristic curve. It is important that clinicians work with 
model developers to present clinically meaningful and rele- 
vant metrics, such as precision and recall. This is highlighted 
by differences in ideal sensitivity and specificity level for 
weaning patients undergoing mechanical ventilation and for 
predicting mortality. However, model calibration was consis- 
tently under-reported. Calibration has important implications 
for patient-level (rather than population-level) performance, 
and is necessary for monitoring drifts after deployment, even 
within the same centre. A model may have near-perfect 
discrimination, but if it is poorly calibrated, clinicians would 
be reasonably reluctant to adopt it. 

The most investigated outcomes were predicting weaning 
success and ventilator complications, a disproportion possibly 
attributable to the clear-cut nature of the outcomes. There was 
a predominance of neural network and random forest algo- 
rithms, where the former may represent relatively superior 
discrimination power capabilities at the expense of inter- 
pretability. All except three studies provided an explanation of 
algorithm performance and attempted to increase interpret- 
ability, but the use of tools, such as feature importance or 
SHapley Additive exPlanations, was under-utilised (3/95).** 

Translation of interpretable models to the bedside re- 
mains a significant challenge, where collaboration is required 
to refocus data science researchers towards clinically 


meaningful and translatable problems. The gap between AI 
model development and subsequent device implementation 
was evident (Fig. 2) with a lack of device creation or subse- 
quent assessment in clinical practice, despite significant in- 
crease in model development. These findings are typical of the 
broader AI literature, with a wide discrepancy between the 
large number of AI publications in 2019 (n=12,422) and the 
number of new useable algorithms receiving approval from 
the FDA for clinical use in 2020 (n=130).*»*° 


Moving beyond the ‘black box’ of algorithm 
development 


Our systematic review identified a dearth of publicly available 
data and code. Upon direct request, few authors (12/54) were 
able to share data and/or code, with the rest of responders 
unable to share data and/or code because of institutional 
ethics, risk of patient data being released, and/or confidenti- 
ality over the code to be submitted for separate peer-reviewed 
publication or patent applications, even if the majority of ar- 
ticles stated ‘Data will be made available on request’, or 
similar. This outcome highlights the need for a system leading 
to internationally achievable anonymisation and curation of 
data, with their dissemination to the research community. 
Code availability should become the rule rather the exception, 
and anonymised data should at least be available on request, 
or preferably stored in publicly available and secure sites. 

The importance of access to well-curated data is funda- 
mental for algorithm development, as is the access process 
transparency. A notable feature in the evaluated studies is the 
absence of reporting of how missing data and outliers were 
managed. Missing data are inevitable in electronic health re- 
cords, and the disproportionate loss of data between different 
subgroups or patients can skew the modelling, similarly to 
attrition bias in randomised studies. 

The COVID pandemic highlighted the limited ability of 
centres to rapidly aggregate clinical data, resulting in a num- 
ber of ‘small, incompatible data’ in contrast with the 
acclaimed ‘big, standardised data’. There is a clear need for 
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standardised structuring, formatting, and curating of collected 
data to facilitate generalisability and multicentre evaluation,” 
requiring close collaboration between clinicians, data scien- 
tists, and AI engineers throughout the life cycle of an algo- 
rithm. Reciprocal familiarisation of clinicians with Al, and of 
data scientists and AI engineers with clinical medicine, is 
required for an efficient progress in the field. 

Overall, there is a significant risk of bias in the reporting 
methodology. This bias does not appear to be specific to me- 
chanical ventilation, as it appears to be a consistent problem 
throughout the AI in healthcare domain.*® 


Future implications 


To develop more robust prediction models, cross-disciplinary 
multicentre collaboration and improved reporting of model 
performance are required. These improvements will require 
data and code availability to become standard in the field to 
prioritise reproducibility over novelty. The goal of person- 
alising ventilation strategy, and more generally healthcare, to 
individual patients will require integration of bidirectional 
data flow: from patients to algorithms for training, and from 
trained algorithms to patients, further highlighting the 
importance of merging data science with clinical medicine. 
The primary challenge in this flow of data is the availability of 
large multicentre, interoperable critical care data sets that 
contain the rich, longitudinal data required for producing 
generalisable models. 

In the short term, making anonymised patient data pub- 
licly available for international collaborative research could 
contribute to a larger data set derived from diverse patient 
cohorts, redrawing the unequal medical knowledge map, 
with benefits reaching beyond national borders and into re- 
gions with limited resources. A key benefit of a system that 
could regularly provide new data is the potential for regular 
verification and re-evaluation of patient care, an iterative 
process that is critical for AI algorithm implementation. This 
process would require a new approach to governance, where 
current institutional policies may not consider potential for 
multiregional collaboration, or solutions for rapid turnover of 
data. 

Long term, the proliferation of interoperable electronic 
health systems and data standards will result in the ability to 
accumulate higher-quality data at source and wider adoption 
of federated approaches for machine learning.” It could also 
support improved ICU care by providing evidence from more 
comprehensive data sets to be applied to the context of the 
local clinical scenario and to complement local expertise, 
reducing the bias inevitably associated with anecdotal expe- 
rience. This approach could also enable the opportunity to 
assess the diversity of management styles for evidence-based 
protocols. True interoperability may be some time off, where 
coordination between stakeholders and improved data access 
will be key steps.*° 


Study limitations 


The guidelines chosen to assess the studies (TRIPOD and 
PROBAST) were designed for prediction models only, so their 
application to assess other AI algorithms may pose limita- 
tions, for example where a few studies may have been over- 
looked. Our elaboration to these guidelines and development 
of AI maturity framework were based upon expert opinion and 
have not been tested before. The PROBAST risk of bias does 


contain a subjective component, and borderline results may 
affect overall interpretation. The decision on which studies 
directly investigated mechanical ventilation and on what 
classified as AI was associated with subjective interpretation 
and definitions. 


Conclusions 


Currently, AI applied to mechanical ventilation has limited 
external validation and model calibration, with substantial 
risk of bias, significant gaps in reporting, and poor code and 
data availability. Rigorous adherence to standards, and broad 
changes in our approach to data and reproducibility, will 
facilitate translation of algorithmic data science into deploy- 
able tools and improved patient care." 
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